xxxxxxxxxx# Auto _Ch 02 - Q9 (applied)_ __Description__ Gas mileage, horsepower, and other information for 392 vehicles.__Source__ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.__References__ This dataset is a part of the course material of the [book](https://www.statlearning.com/) : ___Introduction to Statistical Learning with R___ (Ch 02 - Statistical Learning - Applied Exercises - Problem 9)Ch 02 - Q9 (applied)
Description
Gas mileage, horsepower, and other information for 392 vehicles.
Source
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The dataset was used in the 1983 American Statistical Association Exposition.
References
This dataset is a part of the course material of the book : Introduction to Statistical Learning with R
(Ch 02 - Statistical Learning - Applied Exercises - Problem 9)
xxxxxxxxxx__Short description of variables__ - <b>mpg :</b> miles per gallon - <b>cylinders :</b> Number of cylinders between 4 and 8 - <b>displacement :</b> Engine displacement (cu. inches) - <b>horsepower :</b> Engine horsepower - <b>weight :</b> Vehicle weight (lbs.) - <b>acceleration :</b> Time to accelerate from 0 to 60 mph (sec.) - <b>year :</b> Model year (modulo 100) - <b>origin :</b> Origin of car (1. American, 2. European, 3. Japanese) - <b>name : - <b> Vehicle nameShort description of variables
xxxxxxxxxx<a id='toc'></a>### Index- [1) Load packages](#1%29-Load-packages) - [2) Import Data](#2%29-Import-Data)- [3) Data preparation](#3%29-Data-preparation)- [a) Which of the predictors are quantitative, and which are qualitative?](#(a%29-Which-of-the-predictors-are-quantitative,-and-which-are-qualitative?)- [b) What is the range of each quantitative predictor?](#(b%29-What-is-the-range-of-each-quantitative-predictor?)- [c) What is the mean and standard deviation of each quantitative predictor?](#(c%29-What-is-the-mean-and-standard-deviation-of-each-quantitative-predictor?)- [d) Range, mean and standard deviation after removing observations 10-85](#(d%29-Range,-mean-and-standard-deviation-after-removing-observations-10-85)- [e) Graphical examination of predictors](#(e%29-Graphical-examination-of-predictors)- [f) Variables useful in predicting mpg](#(f%29-Variables-useful-in-predicting-mpg)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------# Load requisite packagesimport osimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inlineimport plotly.express as px# plt.rcParams['figure.dpi'] = 120sns.set_style(rc={"axes.facecolor": "w", 'figure.facecolor':'w'})# plt.rcParams.update({'font.size':10,"axes.titlesize":11,'axes.labelsize':11})# Function to suppress UserWarningsimport warningsdef fxn(): warnings.warn("UserWarning arose", UserWarning)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------# File pathfdir = r"E:\Data Science\Statistics\Intro to Statistical Learning with R"fpath = os.path.join(fdir, 'datasets', 'Auto.csv')os.path.exists(fpath)# Import datadfo = pd.read_csv(fpath)print(dfo.shape)dfo.head(3)xxxxxxxxxx# Check for missing datadfo.isna().sum().sum()xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# Creating a copy that can be modified so that original is preserved if reqddf = dfo.copy(deep=True)xxxxxxxxxx# Checking if variables have been saved per their naturedf.info()xxxxxxxxxxThe fact that a column containing numbers (horsepower) has been saved as dtype 'object' is a red flag, as 'object' dtype can be used to save columns with strings as well as numerals. This column will have to be further examined.The fact that a column containing numbers (horsepower) has been saved as dtype 'object' is a red flag, as 'object' dtype can be used to save columns with strings as well as numerals. This column will have to be further examined.
xxxxxxxxxx# Rows with non-numeric valuesdf[~df['horsepower'].str.isnumeric()]# Variable saved as object is not exactly a string, so .str needs to be usedxxxxxxxxxxSince the number of missing values is very small, those rows can just be deleted.Since the number of missing values is very small, those rows can just be deleted.
xxxxxxxxxx# Rows with ? in horsepowerdf[~df.horsepower.str.isnumeric()]xxxxxxxxxx# Rows with ? in any columndf[df.applymap(lambda x: x=='?').any(axis=1)]# Deleting rows with ?df.drop(df.index[~df.horsepower.str.isnumeric()], axis=0, inplace=True)df.shapex
# Convert horsepower to numericdf['horsepower'] = pd.to_numeric(df.horsepower, errors='coerce')df.horsepower.dtypexxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### (a) Which of the predictors are quantitative, and which are qualitative?xxxxxxxxxx*Quantitative* → numerical values. *Qualitative* → values in one of K different classes, or categories.Quantitative → numerical values.
Qualitative → values in one of K different classes, or categories.
xxxxxxxxxx%%html<style> table {margin-left: 0 !important;}</style>xxxxxxxxxx# No. of unique values in columnsfor col in df.columns: print(f'{col} : {df[col].nunique()}')xxxxxxxxxx| variable | description | variable type| :--- | :--- | :---| mpg | miles per gallon | quantitative| cylinders | Number of cylinders between 4 and 8 | qualitative or categorical| displacement | Engine displacement (cu. inches) | quantitative| horsepower | Engine horsepower | quantitative| weight | Vehicle weight (lbs.) | quantitative| acceleration | Time to accelerate from 0 to 60 mph (sec.) | quantitative| year |Model year (modulo 100) | quantitative| origin | Origin of car (1. American, 2. European, 3. Japanese) | qualitative or categorical| name | Vehicle name | qualitative or categorical"year" can be considered to be quantitative in the sense that it could indirectly reflect the impact of technological abilities of the times, otherwise it can be considered qualitative (categorical).| variable | description | variable type |
|---|---|---|
| mpg | miles per gallon | quantitative |
| cylinders | Number of cylinders between 4 and 8 | qualitative or categorical |
| displacement | Engine displacement (cu. inches) | quantitative |
| horsepower | Engine horsepower | quantitative |
| weight | Vehicle weight (lbs.) | quantitative |
| acceleration | Time to accelerate from 0 to 60 mph (sec.) | quantitative |
| year | Model year (modulo 100) | quantitative |
| origin | Origin of car (1. American, 2. European, 3. Japanese) | qualitative or categorical |
| name | Vehicle name | qualitative or categorical |
"year" can be considered to be quantitative in the sense that it could indirectly reflect the impact of technological abilities of the times, otherwise it can be considered qualitative (categorical).
xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### (b) What is the range of each quantitative predictor?xxxxxxxxxx# Quantitative predictorsquant_data = df[df.columns[~df.columns.isin(['cylinders','origin','name'])]]quant_data.columns.tolist()xxxxxxxxxx# 5 no. summarydf_summary = quant_data.describe()summ = df_summary.loc[['min','max']]summ.loc['range'] = summ.loc['max'] - summ.loc['min']summxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### (c) What is the mean and standard deviation of each quantitative predictor?xxxxxxxxxxfor i in ['mean','std']: summ.loc[i] = df_summary.loc[i]summ.round(2)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### (d) Range, mean and standard deviation after removing observations 10-85xxxxxxxxxx# Removing rows 10-85df1 = quant_data.drop(quant_data.index[9:85], axis=0)# df1 = quant_data.iloc[~quant_data.index.isin(range(9,86))]# To remove obs 10-85, >> range(10-1,85-1+1) >> range(9,85) << 76 rows# But range() is behaving differently within .isin() << removing 75 only (till 84), # To remove till 85th obs, .isin(range(9,86)) has to be useddf1.shapexxxxxxxxxx# Check if the 10th row in filtered dataset matches 86th row of unfiltered dataset(df1.iloc[9] == quant_data.iloc[85]).all()xxxxxxxxxxdf1_summary = df1.describe()summ1 = df1_summary.loc[['min','max']]summ1.loc['range'] = summ1.loc['max'] - summ1.loc['min']for i in ['mean','std']: summ1.loc[i] = df1_summary.loc[i]summ1.round(3)xxxxxxxxxx# Check sumdf1.apply(lambda x: sum(x))xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------# Hue - Cylinder with warnings.catch_warnings(): warnings.simplefilter("ignore") fxn() fig = sns.PairGrid(df, vars=df.columns[~df.columns.isin(['cylinders','origin','name'])].tolist(), hue='cylinders') plt.gcf().set_size_inches(17,15) fig.map_diag(sns.histplot) fig.map_upper(sns.scatterplot) fig.map_lower(sns.kdeplot) fig.add_legend(ncol=5, loc=1, bbox_to_anchor=(0.5, 1.05), borderaxespad=0, frameon=False);xxxxxxxxxx<div class="alert alert-block alert-info"><a id=''></a><b>Observations:</b><br> - acceleration appears to be roughly normally distributed.<br> - Increase in no. of cylinders leads to <br>  • lower : mpg, acceleration<br>  • higher : displacement, horsepower, weight<br> - There has been a decline in the no. of new models coming out with 8 cylinders.<br> - Newer models are lighter and and have loss horsepower (presumably because of decreased weight).<br> - mpg appears to have strong (non-linear) relationships with displacement, horsepower, weight and is negatively correlated with the 3 variables.<br> - Fuel efficiency of new models has imporoved over the years. Minimum mpg has almost doubled in the span of 12 years.<br>  Mazda GLC, of Japenese origin, is the car with the highest mpg, 44.6, and came out in 1980.<br> - displacement, horsepower and weight appear to have a strong positive correlation with each other.<br> - A moderate negative correlation may exist between horsepower and acceleration.<br></div>xxxxxxxxxx# Converting column cylinder to factor before using for 'color'df.cylinders = df.cylinders.astype('category')# Scatter plot - Cylinders as huepal = ['#fdc086','#386cb0','#beaed4','#33a02c','#f0027f']col_map = dict(zip(sorted(df.cylinders.unique()), pal))fig = px.scatter(df, y='mpg', x='year', color='cylinders', color_discrete_map=col_map, hover_data=['name','origin'])fig.update_layout(width=800, height=400, plot_bgcolor='#fff')fig.update_traces(marker=dict(size=8, line=dict(width=0.2,color='DarkSlateGrey')), selector=dict(mode='markers'))fig.show()x
# Converting column cylinder to factor before using for 'color'df.origin = df.origin.astype('category')# Scatter plot - Cylinders as huesymb = ['triangle-up','circle','x']symb_map = dict(zip(sorted(df.origin.unique()), symb))fig = px.scatter(df, y='mpg', x='year', color='origin', hover_data=['name','cylinders'], color_discrete_sequence=px.colors.qualitative.Set1, symbol='origin', symbol_map=symb_map)fig.update_layout(title=dict(text="Scatter Plot", xanchor='left', x=0.4, yanchor="top", y=0.99), width=700, height=450, plot_bgcolor='#fff', showlegend=True, legend=dict(orientation="h", xanchor="left", x=0.3, yanchor='top', y=1))fig.update_traces(marker=dict(size=8, line=dict(width=0.2,color='DarkSlateGrey')), selector=dict(mode='markers'))fig.show()# dir(px.colors.qualitative)# https://plotly.com/python/marker-style/xxxxxxxxxx**origin** : 1 - American, 2 - European, 3 - Japanese<br>origin : 1 - American, 2 - European, 3 - Japanese
# Hue - Originwith warnings.catch_warnings(): warnings.simplefilter("ignore") fxn() fig = sns.PairGrid(df, vars=df.columns[~df.columns.isin(['cylinders','origin','name'])].tolist(), hue='origin', palette=sns.cubehelix_palette(3, as_cmap=False)) plt.gcf().set_size_inches(12,12) fig.map_diag(sns.histplot) fig.map_upper(sns.scatterplot) fig.map_lower(sns.kdeplot) fig.add_legend(ncol=5, loc=1, bbox_to_anchor=(0.5, 1.05), borderaxespad=0, frameon=False);xxxxxxxxxx<div class="alert alert-block alert-info"><a id=''></a><b>Observations:</b><br> - Cars of European (2) and Japenese (3) origin can be seen to be overlapping in every criteria in the contour plots whereas American cars (1) have a larger and distict spread.<br> - Clear distinctions can be seen between American and the other 2 carmakers in displacement, horsepower and weight.</div># Frequency distribution of cylindersdf.cylinders.value_counts(sort=False)# year-wise distribution of cars with cylinder countcyl_year = pd.pivot_table(df, values='name', index='year', columns='cylinders', aggfunc=len, fill_value=0)plt.plot(cyl_year)plt.gca().set(frame_on=False)plt.gcf().set_size_inches(10,4)plt.legend(cyl_year.columns, bbox_to_anchor=(1.1, 0.8), borderaxespad=0, frameon=False);cyl_yearxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxxExcept for acceleration, all the varibles display some sort of relationship or trend with mpg, whether positive or negative. <br>_Positive_ : year<br>_Negative_ : cylinders, displacement, horsepower, weight<br>_Non-directional_ : origin<br>They can be taken into account for predicting mpg, after adjusting for collinearity.Except for acceleration, all the varibles display some sort of relationship or trend with mpg, whether positive or negative.
Positive : year
Negative : cylinders, displacement, horsepower, weight
Non-directional : origin
They can be taken into account for predicting mpg, after adjusting for collinearity.